Search CORE

57 research outputs found

Judicious Thread Migration When Accessing Distributed Shared Caches

Author: Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 01/01/2012
Field of study

Chip-multiprocessors (CMPs) have become the mainstream chip design in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Architecture (NUCA) design, where on chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 18% on average and at best by 2.3X over the standard NUCA design that only uses remote accesses

CiteSeerX

DSpace@MIT

The Execution Migration Machine: Directoryless Shared-Memory Architecture

Author: Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2015
Field of study

For certain applications involving chip multiprocessors with more than 16 cores, a directoryless architecture with fine-grained and partial-context thread migration can outperform directory-based coherence, providing lighter on-chip traffic and reduced verification complexity

DSpace@MIT

Guaranteed in-order packet delivery using Exclusive Dynamic Virtual Channel Allocation

Author: Cho Myong Hyon
Devadas Srinivas
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 18/08/2009
Field of study

In-order packet delivery, a critical abstraction for many higher-level protocols, can severely limit the performance potential in low-latency networks (common, for example, in network-on-chip designs with many cores). While basic variants of dimension-order routing guarantee in-order delivery, improving performance by adding multiple dynamically allocated virtual channels or using other routing schemes compromises this guarantee. Although this can be addressed by reordering out-of-order packets at the destination core, such schemes incur significant overheads, and, in the worst case, raise the specter of deadlock or require expensive retransmission. We present Exclusive Dynamic VCA, an oblivious virtual channel allocation scheme which combines the performance advantages of dynamic virtual allocation with in-network, deadlock-free in-order delivery. At the same time, our scheme reduces head-of-line blocking, often significantly improving throughput compared to equivalent baseline (out-of-order) dimension-order routing when multiple virtual channels are used, and so may be desirable even when in-order delivery is not required. Implementation requires only minor, inexpensive changes to traditional oblivious dimension-order router architectures, more than offset by the removal of packet reorder buffers and logic

DSpace@MIT

Library Cache Coherence

Author: Cho Myong Hyon
Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 02/05/2011
Field of study

Directory-based cache coherence is a popular mechanism for chip multiprocessors and multicores. The directory protocol, however, requires multicast for invalidation messages and the collection of acknowledgement messages, which can be expensive in terms of latency and network traffic. Furthermore, the size of the directory increases with the number of cores. We present Library Cache Coherence (LCC), which requires neither broadcast/multicast for invalidations nor waiting for invalidation acknowledgements. A library is a set of timestamps that are used to auto-invalidate shared cache lines, and delay writes on the lines until all shared copies expire. The size of library is independent of the number of cores. By removing the complex invalidation process of directory-based cache coherence protocols, LCC generates fewer network messages. At the same time, LCC also allows reads on a cache block to take place while a write to the block is being delayed, without breaking sequential consistency. As a result, LCC has 1.85X less average memory latency than a MESI directory-based protocol on our set of benchmarks, even with a simple timestamp choosing algorithm; moreover, our experimental results on LCC with an ideal timestamp scheme (though not implementable) show the potential of further improvement for LCC with more sophisticated timestamp schemes

DSpace@MIT

Scalable directoryless shared memory coherence using execution migration

Author: Cho Myong Hyon
Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 22/11/2010
Field of study

We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 6.8% and a traditional S-NUCA design by 9.2%. We argue that with EM scaling performance has much lower cost and design complexity than in directory-based coherence and traditional NUCA architectures: by merely scaling network bandwidth from 128 to 256 (512) bit flits, the performance of our architecture improves by an additional 8% (12%), while the baselines show negligible improvement

DSpace@MIT

Deadlock-free fine-grained thread migration

Author: Keun Sup Shim
Mieszko Lis
Myong Hyon Cho
Omer Khan
Srinivas Devadas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Several recent studies have proposed fine-grained, hardware-level thread migration in multicores as a solution to power, reliability, and memory coherence problems. The need for fast thread migration has been well documented, however, a fast, deadlock-free migration protocol is sorely lacking: existing solutions either deadlock or are too slow and cumbersome to ensure performance with frequent, fine-grained thread migrations. In this study, we introduce the Exclusive Native Context (ENC) protocol, a general, provably deadlock-free migration protocol for instruction-level thread migration architectures. Simple to implement, ENC does not require additional hardware beyond common migration-based architectures. Our evaluation using synthetic migrations and the SPLASH-2 application suite shows that ENC offers performance within 11.7% of an idealized deadlock-free migration protocol with infinite resources

CiteSeerX

DSpace@MIT

Crossref

Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

Author: Cho Myong Hyon
Devadas Srinivas
Lebedev Ilia A.
Lis Mieszko
Shim Keun Sup
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2013
Field of study

As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM[superscript 2]), establishes a new design point. On the one hand, EM[superscript 2] supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM[superscript 2] chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM[superscript 2] is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months

CiteSeerX

DSpace@MIT

Crossref

Recommended from our members

Inhibition of cAMP/PKA Pathway Protects Optic Nerve Head Astrocytes against Oxidative Stress by Akt/Bax Phosphorylation-Mediated Mfn1/2 Oligomerization.

Author: Ahn Sangphil
Edwards Genea
Ju Won-Kyu
Kim Keun-Young
Park Tae Lim
Shim Myoung Sup
Weinreb Robert N
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Glaucoma is characterized by a progressive optic nerve degeneration and retinal ganglion cell loss, but the underlying biological basis for the accompanying neurodegeneration is not known. Accumulating evidence indicates that structural and functional abnormalities of astrocytes within the optic nerve head (ONH) have a role in glaucomatous neurodegeneration. Here, we investigate the impact of activation of cyclic adenosine 3',5'-monophosphate (cAMP)/protein kinase A (PKA) pathway on mitochondrial dynamics of ONH astrocytes exposed to oxidative stress. ONH astrocytes showed a significant loss of astrocytic processes in the glial lamina of glaucomatous DBA/2J mice, accompanied by basement membrane thickening and collagen deposition in blood vessels and axonal degeneration. Serial block-face scanning electron microscopy data analysis demonstrated that numbers of total and branched mitochondria were significantly increased in ONH astrocytes, while mitochondrial length and volume density were significantly decreased. We found that hydrogen peroxide- (H2O2-) induced oxidative stress compromised not only mitochondrial bioenergetics by reducing the basal and maximal respiration but also balance of mitochondrial dynamics by decreasing dynamin-related protein 1 (Drp1) protein expression in rat ONH astrocytes. In contrast, elevated cAMP by dibutyryl-cAMP (dbcAMP) or isobutylmethylxanthine treatment significantly increased Drp1 protein expression in ONH astrocytes. Elevated cAMP exacerbated the impairment of mitochondrial dynamics and reduction of cell viability to oxidative stress in ONH astrocytes by decreasing optic atrophy type 1 (OPA1), and mitofusin (Mfn)1/2 protein expression. Following combined treatment with H2O2 and dbcAMP, PKA inhibition restored mitochondrial dynamics by increasing mitochondrial length and decreasing mitochondrial number, and this promoted cell viability in ONH astrocytes. Also, PKA inhibition significantly promoted Akt/Bax phosphorylation and Mfn1/2 oligomerization in ONH astrocytes. These results suggest that modulation of the cAMP/PKA signaling pathway may have therapeutic potential by activating Akt/Bax phosphorylation and promoting Mfn1/2 oligomerization in glaucomatous ONH astrocytes

eScholarship - University of California

Static virtual channel allocation in oblivious routing

Author: Cho Myong Hyon
Devadas Srinivas
Kinsy Michel A.
Lis Mieszko
Shim Keun Sup
Wen Tina
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Most virtual channel routers have multiple virtual channels to mitigate the effects of head-of-line blocking. When there are more flows than virtual channels at a link, packets or flows must compete for channels, either in a dynamic way at each link or by static assignment computed before transmission starts. In this paper, we present methods that statically allocate channels to flows at each link when oblivious routing is used, and ensure deadlock freedom for arbitrary minimal routes when two or more virtual channels are available. We then experimentally explore the performance trade-offs of static and dynamic virtual channel allocation for various oblivious routing methods, including DOR, ROMM, Valiant and a novel bandwidth-sensitive oblivious routing scheme (BSORM). Through judicious separation of flows, static allocation schemes often exceed the performance of dynamic allocation schemes

DSpace@MIT

Crossref

Scalable, accurate multicore simulation in the 1000-core era

Author: Cho Myong Hyon
Devadas Srinivas
Fletcher Christopher Wardlaw
Khan Omer
Lis Mieszko
Ren Pengju
Shim Keun Sup
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2011
Field of study

We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued worm-hole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11 ×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, and parameters driving power and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple DOR routing to complex Valiant, ROMM, or PROM schemes, BSOR, and adaptive routing. HORNET can run in network-only mode using synthetic traffic or traces, directly emulate a MIPS-based multicore, or function as the memory subsystem for native applications executed under the Pin instrumentation tool. HORNET is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/

DSpace@MIT

Crossref